parsermd

implements a C++ parser and abstract syntax tree (AST) for Quarto and R Markdown documents in R.

  • supports manipulating ASTs (filtering, editing, etc.)

  • nodes classes use S7 for validation and dispatch

  • ability to directly source and render ASTs

  • off-and-on project since covid, original use case was to aide in the grading for a large machine learning course

  • v0.1.3 is on CRAN, v0.2.0 with full Quarto support on GitHub (CRAN soon*)

  • Quarto examples today, but everything works with RMarkdown

hello.qmd

---
title: "Hello, Quarto"
format:
  html:
    self-contained: true
---
  
```{r}
#| label: load-packages
#| include: false

library(tidyverse)
library(palmerpenguins)
```

## Meet Quarto

Quarto enables you to weave together content and executable code into a finished document. 
To learn more about Quarto see <https://quarto.org>.

## Meet the penguins

![](https://raw.githubusercontent.com/quarto-dev/quarto-web/main/docs/get-started/hello/rstudio/lter_penguins.png){style="float:right;" fig-alt="Illustration of three species of Palmer Archipelago penguins: Chinstrap, Gentoo, and Adelie. Artwork by @allison_horst." width="401"}

The `penguins` data from the [**palmerpenguins**](https://allisonhorst.github.io/palmerpenguins "palmerpenguins R package") 
package contains size measurements for `{r} nrow(penguins)` penguins from three species
observed on three islands in the Palmer Archipelago, Antarctica.

The plot below shows the relationship between flipper and bill lengths of these penguins.

```{r}
#| label: plot-penguins
#| warning: false
#| echo: false

ggplot(penguins, 
       aes(x = flipper_length_mm, y = bill_length_mm)) +
  geom_point(aes(color = species, shape = species)) +
  scale_color_manual(values = c("darkorange","purple","cyan4")) +
  labs(
    title = "Flipper and bill length",
    subtitle = "Dimensions for penguins at Palmer Station LTER",
    x = "Flipper length (mm)", y = "Bill length (mm)",
    color = "Penguin species", shape = "Penguin species"
  ) +
  theme_minimal()
```

## Other Quarto features

### Fenced divs

:::{.callout-note}
Note that there are five types of callouts, including: 
`note`, `tip`, `warning`, `caution`, and `important`.
:::

### Markdown code blocks

Some sample python code,

```python
import numpy as np
import matplotlib.pyplot as plt

r = np.arange(0, 2, 0.01)
theta = 2 * np.pi * r
fig, ax = plt.subplots(
  subplot_kw = {'projection': 'polar'} 
)
ax.plot(theta, r)
ax.set_rticks([0.5, 1, 1.5, 2])
ax.grid(True)
plt.show()
```

### Short codes

Shortcodes are special markdown directives that generate various types of content,

{{< lipsum 1 >}}

Elements as AST

qmd = parse_qmd("hello.qmd")
qmd |> print(flat = TRUE)
├── YAML [2 fields]
├── Markdown [1 line]
├── Chunk [r, 4 lines] - load-packages
├── Heading [h2] - Meet Quarto
├── Markdown [1 line]
├── Heading [h2] - Meet the penguins
├── Markdown [7 lines]
├── Chunk [r, 12 lines] - plot-penguins
├── Heading [h2] - Other Quarto features
├── Heading [h3] - Fenced divs
├── Open Fenced div [.callout-note]
├── Markdown [2 lines]
├── Close Fenced div 
├── Heading [h3] - Markdown code blocks
├── Markdown [1 line]
├── Code block [python, 12 lines]
├── Heading [h3] - Short codes
└── Markdown [3 lines]
qmd |> print()
├── YAML [2 fields]
├── Markdown [1 line]
├── Chunk [r, 4 lines] - load-packages
├── Heading [h2] - Meet Quarto
│   └── Markdown [1 line]
├── Heading [h2] - Meet the penguins
│   ├── Markdown [7 lines]
│   └── Chunk [r, 12 lines] - plot-penguins
└── Heading [h2] - Other Quarto features
    ├── Heading [h3] - Fenced divs
    │   ├── Open Fenced div [.callout-note]
    │   │   └── Markdown [2 lines]
    │   └── Close Fenced div 
    ├── Heading [h3] - Markdown code blocks
    │   ├── Markdown [1 line]
    │   └── Code block [python, 12 lines]
    └── Heading [h3] - Short codes
        └── Markdown [3 lines]

Why hierarchical?

Assuming a hierarchy lets us use a CSS selector like approach to target specific nodes based on headings and their descendants,

qmd |> rmd_select(by_section("Meet Quarto"))
├── YAML [2 fields]
└── Heading [h2] - Meet Quarto
    └── Markdown [1 line]
qmd |> rmd_select(by_section("Fenced divs"))
├── YAML [2 fields]
└── Heading [h3] - Fenced divs
    ├── Open Fenced div [.callout-note]
    │   └── Markdown [2 lines]
    └── Close Fenced div 
qmd |> 
  rmd_select(
    by_section(c("Other Quarto features", "*code*"))
  )
├── YAML [2 fields]
└── Heading [h2] - Other Quarto features
    ├── Heading [h3] - Markdown code blocks
    │   ├── Markdown [1 line]
    │   └── Code block [python, 12 lines]
    └── Heading [h3] - Short codes
        └── Markdown [3 lines]
qmd |> 
  rmd_select(
    by_section(
      c("Other Quarto features", "*code*"), 
      keep_parents = FALSE
    ), 
    keep_yaml = FALSE
  )
├── Heading [h3] - Markdown code blocks
│   ├── Markdown [1 line]
│   └── Code block [python, 12 lines]
└── Heading [h3] - Short codes
    └── Markdown [3 lines]

as_document()

ASTs and nodes can be converted back to Quarto documents,

qmd |> 
  rmd_select(by_section("Meet Quarto")) |>
  as_document() |>
  cat(sep = "\n")
---
title: Hello, Quarto
format:
  html:
    self-contained: true
---

## Meet Quarto

Quarto enables you to weave together content and executable code into a finished document. To learn more about Quarto see <https://quarto.org>.
qmd |> 
  rmd_select(by_section("Fenced divs")) |>
  as_document() |>
  cat(sep = "\n")
---
title: Hello, Quarto
format:
  html:
    self-contained: true
---

### Fenced divs

::: {.callout-note}

Note that there are five types of callouts, including: 
`note`, `tip`, `warning`, `caution`, and `important`.


:::

Additional selectors

rmd_select() and helpers are built using tidyselect (multiple selectors are or’d / unioned together)

  • by_section()

  • has_type()

  • has_label()

  • has_heading()

  • has_option()

  • has_shortcode()

  • by_fdiv()

qmd |> rmd_select(has_label("*p*"))
├── YAML [2 fields]
├── Chunk [r, 4 lines] - load-packages
└── Chunk [r, 12 lines] - plot-penguins
qmd |> rmd_select(
  has_heading("Meet Quarto"):has_heading("Meet the penguins")
)
├── YAML [2 fields]
├── Heading [h2] - Meet Quarto
│   └── Markdown [1 line]
└── Heading [h2] - Meet the penguins
qmd |> rmd_select(has_type(c("rmd_yaml", "rmd_chunk")))
├── YAML [2 fields]
├── Chunk [r, 4 lines] - load-packages
└── Chunk [r, 12 lines] - plot-penguins
qmd |> rmd_select(!has_type("rmd_markdown"))
├── YAML [2 fields]
├── Chunk [r, 4 lines] - load-packages
├── Heading [h2] - Meet Quarto
├── Heading [h2] - Meet the penguins
│   └── Chunk [r, 12 lines] - plot-penguins
└── Heading [h2] - Other Quarto features
    ├── Heading [h3] - Fenced divs
    │   ├── Open Fenced div [.callout-note]
    │   └── Close Fenced div 
    ├── Heading [h3] - Markdown code blocks
    │   └── Code block [python, 12 lines]
    └── Heading [h3] - Short codes

Rendering

ASTs can also be directly rendered

qmd |> render("hello_quarto")
qmd |> rmd_select(has_type("rmd_chunk")) |> render("hello_quarto_code")

Modifying ASTs

rmd_modify() is a recent addition that allows for modifying ASTs in place, the arguments are a node modifying function and then one or more rmd_select() helper functions.

qmd |> 
  rmd_select(has_type("rmd_chunk")) |>
  as_document() |>
  cat(sep="\n")
---
title: Hello, Quarto
format:
  html:
    self-contained: true
---

```{r}
#| label: load-packages


library(tidyverse)
library(palmerpenguins)
```

```{r}
#| label: plot-penguins
#| warning: false
#| echo: false

ggplot(penguins, 
       aes(x = flipper_length_mm, y = bill_length_mm)) +
  geom_point(aes(color = species, shape = species)) +
  scale_color_manual(values = c("darkorange","purple","cyan4")) +
  labs(
    title = "Flipper and bill length",
    subtitle = "Dimensions for penguins at Palmer Station LTER",
    x = "Flipper length (mm)", y = "Bill length (mm)",
    color = "Penguin species", shape = "Penguin species"
  ) +
  theme_minimal()
```
qmd |>
  rmd_select(has_type("rmd_chunk")) |>
  rmd_modify(
    function(x) {
      rmd_node_options(x) = list(echo=TRUE, message=FALSE)
      x
    },
    has_type("rmd_chunk")
  ) |>
  as_document() |>
  cat(sep="\n")
---
title: Hello, Quarto
format:
  html:
    self-contained: true
---

```{r}
#| label: load-packages
#| echo: true
#| message: false


library(tidyverse)
library(palmerpenguins)
```

```{r}
#| label: plot-penguins
#| warning: false
#| echo: true
#| message: false

ggplot(penguins, 
       aes(x = flipper_length_mm, y = bill_length_mm)) +
  geom_point(aes(color = species, shape = species)) +
  scale_color_manual(values = c("darkorange","purple","cyan4")) +
  labs(
    title = "Flipper and bill length",
    subtitle = "Dimensions for penguins at Palmer Station LTER",
    x = "Flipper length (mm)", y = "Bill length (mm)",
    color = "Penguin species", shape = "Penguin species"
  ) +
  theme_minimal()
```

qmd |>
  rmd_select(has_type("rmd_chunk")) |>
  rmd_modify(
    function(x) {
      rmd_node_options(x) = list(echo=TRUE, message=FALSE)
      x
    },
    has_type("rmd_chunk")
  ) |>
  as_document() |>
  render("hello_quarto_code2")

Example Workflow


One file to rule them all

Problem statement

I distribute assignments as GitHub repos that typically contain a README.md and hw1.qmd file.

I inevitably end up having to maintain both hw1/ and hw1-key/ versions of the assignment.

  • Different repos for different audiences: students vs TAs respectively

  • Repos have a tendency to drift over time

  • Single repo with student scaffolding and solution code is ideal for maintenance but clunky for actual work

hw1.qmd

---
title: "Homework 3 - Data Analysis with R"
author: "Your Name"
date: "Due: Friday, March 15, 2024"
format: html
execute:
  warning: false
  message: false
---

## Setup

Load the required packages for this assignment:

```{r setup}
library(tidyverse)
library(palmerpenguins)
```

## Exercise 1: Basic Data Exploration

Examine the `penguins` dataset from the `palmerpenguins` package. Your task is to create a summary of the dataset that shows the number of observations and variables, and identify any missing values.

```{r ex1-student}
# Write your code here to:
# 1. Display the dimensions of the penguins dataset
# 2. Show the structure of the dataset
# 3. Count missing values in each column

```

```{r ex1-key}
# Solution: Basic data exploration
# 1. Display dimensions
cat("Dataset dimensions:", dim(penguins), "\n")
cat("Rows:", nrow(penguins), "Columns:", ncol(penguins), "\n\n")

# 2. Show structure
str(penguins)

# 3. Count missing values
cat("\nMissing values by column:\n")
penguins %>%
  summarise(across(everything(), ~ sum(is.na(.))))
```

## Exercise 2: Data Visualization

Create a scatter plot showing the relationship between flipper length and body mass for penguins. Color the points by species and add appropriate labels and a title.

```{r ex2-student}
# Create a scatter plot with:
# - x-axis: flipper_length_mm
# - y-axis: body_mass_g
# - color by species
# - add appropriate labels and title

ggplot(data = penguins, aes(x = ___, y = ___)) +
  geom_point(aes(color = ___)) +
  labs(
    title = "___",
    x = "___",
    y = "___"
  )
```

```{r ex2-key}
# Solution: Scatter plot of flipper length vs body mass
ggplot(data = penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point(aes(color = species), alpha = 0.8, size = 2) +
  labs(
    title = "Penguin Flipper Length vs Body Mass by Species",
    x = "Flipper Length (mm)",
    y = "Body Mass (g)",
    color = "Species"
  ) +
  theme_minimal() +
  scale_color_viridis_d()
```

## Exercise 3: Statistical Analysis

Calculate summary statistics for bill length by species. Create a table showing the mean, median, standard deviation, and count for each species.

```{r ex3-student}
# Calculate summary statistics for bill_length_mm by species
# Include: mean, median, standard deviation, and count
# Remove missing values before calculating

penguins %>%
  # Add your code here

```

```{r ex3-key}
# Solution: Summary statistics for bill length by species
penguins %>%
  filter(!is.na(bill_length_mm)) %>%
  group_by(species) %>%
  summarise(
    count = n(),
    mean_bill_length = round(mean(bill_length_mm), 2),
    median_bill_length = round(median(bill_length_mm), 2),
    sd_bill_length = round(sd(bill_length_mm), 2),
    .groups = "drop"
  ) %>%
  arrange(desc(mean_bill_length))
```

## Exercise 4: Advanced Data Manipulation

Filter the dataset to include only penguins with complete data (no missing values), then create a new variable called `bill_ratio` that represents the ratio of bill length to bill depth. Finally, identify which species has the highest average bill ratio.

```{r ex4-student}
# Step 1: Filter for complete cases
# Step 2: Create bill_ratio variable (bill_length_mm / bill_depth_mm)
# Step 3: Calculate average bill_ratio by species
# Step 4: Identify species with highest average ratio

```

```{r ex4-key}
# Solution: Advanced data manipulation
complete_penguins = penguins %>%
  # Remove rows with any missing values
  filter(complete.cases(.)) %>%
  # Create bill_ratio variable
  mutate(bill_ratio = bill_length_mm / bill_depth_mm)

# Calculate average bill ratio by species
bill_ratio_summary = complete_penguins %>%
  group_by(species) %>%
  summarise(
    avg_bill_ratio = round(mean(bill_ratio), 3),
    n = n(),
    .groups = "drop"
  ) %>%
  arrange(desc(avg_bill_ratio))

print(bill_ratio_summary)

# Identify species with highest average bill ratio
highest_ratio_species = bill_ratio_summary %>%
  slice_max(avg_bill_ratio, n = 1) %>%
  pull(species)

cat("\nSpecies with highest average bill ratio:", as.character(highest_ratio_species))
```

## Bonus Exercise: Conditional Logic

Write a function that categorizes penguins as "small", "medium", or "large" based on their body mass. Use the following criteria:
- Small: body mass < 3500g
- Medium: body mass between 3500g and 4500g  
- Large: body mass > 4500g

Apply this function to create a new column and create a summary table.

```{r bonus-student}
# Create a function to categorize penguins by size
categorize_size = function(mass) {
  # Add your conditional logic here
  
}

# Apply the function and create summary
```

```{r bonus-key}
# Solution: Conditional logic for size categorization
categorize_size = function(mass) {
  case_when(
    is.na(mass) ~ "Unknown",
    mass < 3500 ~ "Small",
    mass >= 3500 & mass <= 4500 ~ "Medium",
    mass > 4500 ~ "Large"
  )
}

# Apply the function and create summary
penguins_with_size = penguins %>%
  mutate(size_category = categorize_size(body_mass_g))

# Create summary table
size_summary = penguins_with_size %>%
  count(species, size_category) %>%
  pivot_wider(names_from = size_category, values_from = n, values_fill = 0)

print(size_summary)

# Overall size distribution
penguins_with_size %>%
  count(size_category) %>%
  mutate(percentage = round(n / sum(n) * 100, 1))
```
(hw1 = parse_qmd("hw1.qmd"))
├── YAML [5 fields]
├── Heading [h2] - Setup
│   ├── Markdown [1 line]
│   └── Chunk [r, 2 lines] - setup
├── Heading [h2] - Exercise 1: Basic Data Exploration
│   ├── Markdown [1 line]
│   ├── Chunk [r, 5 lines] - ex1-student
│   └── Chunk [r, 12 lines] - ex1-key
├── Heading [h2] - Exercise 2: Data Visualization
│   ├── Markdown [1 line]
│   ├── Chunk [r, 13 lines] - ex2-student
│   └── Chunk [r, 11 lines] - ex2-key
├── Heading [h2] - Exercise 3: Statistical Analysis
│   ├── Markdown [1 line]
│   ├── Chunk [r, 7 lines] - ex3-student
│   └── Chunk [r, 12 lines] - ex3-key
├── Heading [h2] - Exercise 4: Advanced Data Manipulation
│   ├── Markdown [1 line]
│   ├── Chunk [r, 5 lines] - ex4-student
│   └── Chunk [r, 25 lines] - ex4-key
└── Heading [h2] - Bonus Exercise: Conditional Logic
    ├── Markdown [6 lines]
    ├── Chunk [r, 7 lines] - bonus-student
    └── Chunk [r, 25 lines] - bonus-key

Versions

Student

hw1 |>
  rmd_select(
    !has_label("*-key")
  ) |>
  rmd_modify(
    function(x) {
      rmd_node_label(x) = stringr::str_remove(rmd_node_label(x), "-student")
      x
    },
    has_label("*-student")
  )
├── YAML [5 fields]
├── Heading [h2] - Setup
│   ├── Markdown [1 line]
│   └── Chunk [r, 2 lines] - setup
├── Heading [h2] - Exercise 1: Basic Data Exploration
│   ├── Markdown [1 line]
│   └── Chunk [r, 5 lines] - ex1
├── Heading [h2] - Exercise 2: Data Visualization
│   ├── Markdown [1 line]
│   └── Chunk [r, 13 lines] - ex2
├── Heading [h2] - Exercise 3: Statistical Analysis
│   ├── Markdown [1 line]
│   └── Chunk [r, 7 lines] - ex3
├── Heading [h2] - Exercise 4: Advanced Data Manipulation
│   ├── Markdown [1 line]
│   └── Chunk [r, 5 lines] - ex4
└── Heading [h2] - Bonus Exercise: Conditional Logic
    ├── Markdown [6 lines]
    └── Chunk [r, 7 lines] - bonus

TA

hw1 |>
  rmd_select(
    has_heading(c("Exercise *", "Bonus*")),
    has_label(c("*-key", "setup"))
  ) |>
  rmd_modify(
    function(x) {
      rmd_node_options(x) = list(include = FALSE)
      x
    },
    has_label("setup")
  )
├── YAML [5 fields]
├── Chunk [r, 2 lines] - setup
├── Heading [h2] - Exercise 1: Basic Data Exploration
│   └── Chunk [r, 12 lines] - ex1-key
├── Heading [h2] - Exercise 2: Data Visualization
│   └── Chunk [r, 11 lines] - ex2-key
├── Heading [h2] - Exercise 3: Statistical Analysis
│   └── Chunk [r, 12 lines] - ex3-key
├── Heading [h2] - Exercise 4: Advanced Data Manipulation
│   └── Chunk [r, 25 lines] - ex4-key
└── Heading [h2] - Bonus Exercise: Conditional Logic
    └── Chunk [r, 25 lines] - bonus-key

What’s next?

  • The current version will be going up on CRAN soon (revdep checks still need work, other minor polishing)

  • Building out and documenting interesting use cases

  • Building out tools using this infrastructure

  • Improved ergonomics

Sneak peek - markermd

Reach out